Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13330: [Go][Parquet] Add the rest of the Encoding package #10716

Closed
wants to merge 5 commits into from

Conversation

zeroshade
Copy link
Member

@emkornfield Thanks for merging the previous PR #10379

Here's the remaining files that we pulled out of that PR to shrink it down, including all the unit tests for the Encoding package.

@github-actions
Copy link

@zeroshade
Copy link
Member Author

@emkornfield @sbinet Bump for visibility to get reviews

@emkornfield
Copy link
Contributor

Sorry busy week this week, will try to get to it EOW or sometime next week.

@zeroshade
Copy link
Member Author

@emkornfield bump

@emkornfield
Copy link
Contributor

sorry I have had less time then I would have liked recently for Arrow reviews, will try to get to this soon.

)

func BenchmarkPlainEncodingBoolean(b *testing.B) {
for sz := MINSIZE; sz < MAXSIZE+1; sz *= 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't a built in construct in GO benchmarks for adjusting batch size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. b.N is the number of iterations to run under the benchmarking timer, so by doing this little loop it creates a separate benchmark for each of the batch sizes

}

// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
// the buffer and return it yet, appending these encoded values. Returns the number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it ever return less than the values provided in lvls? If so please document it (if not maybe still note that this is simply for the API consumer's convenience?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should always return len(lvls), if it returns less that means it encountered an error/issue while encoding. I'll add that to the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is up to users to check that? should it propogate an error instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently this is an internal only package so it is not exposed for non-internal consumers to call this and the column writers check the return and propagate an error if it doesn't match. Alternately, I could modify the underlying encoders to have Put return an error instead of just a bool and then propagate an error. I believe currently it just returns true/false for success/failure out of convenience and I never got around to having it return a proper error.

I'll take a look at how big such a change would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this change to add error propagation to all the necessary spots (and all the subsequent calls and dependencies) so that consumers no longer have to rely on checking the number of values returned but can see easily if an error was returned by the encoders.

@zeroshade
Copy link
Member Author

@emkornfield bump. i've responded to and/or addressed all of the comments so far. 😄


func TestDeltaByteArrayEncoding(t *testing.T) {
test := []parquet.ByteArray{[]byte("Hello"), []byte("World"), []byte("Foobar"), []byte("ABCDEF")}
expected := []byte{128, 1, 4, 4, 0, 0, 0, 0, 0, 0, 128, 1, 4, 4, 10, 0, 1, 0, 0, 0, 2, 0, 0, 0, 72, 101, 108, 108, 111, 87, 111, 114, 108, 100, 70, 111, 111, 98, 97, 114, 65, 66, 67, 68, 69, 70}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do the expected values come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Delta Byte Array Encoding was implemented in another parquet library, so I stole the expected values from their unit tests and also did some manual confirmation to ensure they were the correct values to the best of my knowledge.

Reset()
// Size returns the current number of unique values stored in the table
// including whether or not a null value has been passed in using GetOrInsertNull
Size() int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int is 32 bits here? does it pay to have errors returned for the insertion operations if they exceed that range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int is implementation defined, on 32 bit platforms int will be 32 bits, on 64 bit platforms it will be 64 bits.

In this memo table, I guess I assumed that it was unlikely that there'd ever be billions of elements in the table such that it was necessary or likely for a check to exceed the range of an int. Personally i'd prefer to not add the extra check inside of the insertion operation simply because it's a critical path that is likely to be inside of a tight loop so if possible, I'd prefer to avoid adding the check for whether the new size will exceed math.MaxInt but rather just document that the memotable has a limitation on the number of unique values it can hold being MaxInt. Thoughts?

MemoTable
// ValuesSize returns the total number of bytes needed to copy all of the values
// from this table.
ValuesSize() int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this includes overhead for string length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just the raw bytes of the strings as they are stored like in an arrow array (since I back the binary memotable using a BinaryBuilder and just call DataLen on it). This is specifically used for copying the raw values as a single chunk of memory which is why the offsets are stored separately and copied out separately.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I've gone through most of it now. I'm a bit surprise d acustom hash table implementation performs best (I'd expected Go's implementation to be highly tuned), any theories as to why?

@zeroshade
Copy link
Member Author

@emkornfield The performance difference is actually entirely based on the data and types (which is why I left both implementations in here).

My current theory is that the custom hash table implementation (which I took from the C++ memo table implementation) is that it's simply a case of optimized handling of smaller values resulting is significantly fewer allocations.

With go1.16.1 on my local laptop:

goos: windows
goarch: amd64
pkg: github.com/apache/arrow/go/parquet/internal/encoding
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkMemoTableFloat64/100_unique_n_65535/go_map-12         	     100	  12464686 ns/op	    7912 B/op	      15 allocs/op
BenchmarkMemoTableFloat64/100_unique_n_65535/xxh3-12           	     230	   5325610 ns/op	    7680 B/op	       2 allocs/op
BenchmarkMemoTableFloat64/1000_unique_n_65535/go_map-12        	      85	  14819479 ns/op	  123337 B/op	      70 allocs/op
BenchmarkMemoTableFloat64/1000_unique_n_65535/xxh3-12          	     186	   5388082 ns/op	  130560 B/op	       4 allocs/op
BenchmarkMemoTableFloat64/5000_unique_n_65535/go_map-12        	      79	  16167963 ns/op	  493125 B/op	     195 allocs/op
BenchmarkMemoTableFloat64/5000_unique_n_65535/xxh3-12          	     132	   7631678 ns/op	  523776 B/op	       5 allocs/op
BenchmarkMemoTableInt32/100_unique_n_65535/xxh3-12             	     247	   4648034 ns/op	    5120 B/op	       2 allocs/op
BenchmarkMemoTableInt32/100_unique_n_65535/go_map-12           	     602	   1885666 ns/op	    6468 B/op	      17 allocs/op
BenchmarkMemoTableInt32/1000_unique_n_65535/xxh3-12            	    1429	    895664 ns/op	   87040 B/op	       4 allocs/op
BenchmarkMemoTableInt32/1000_unique_n_65535/go_map-12          	     632	   1893139 ns/op	  104802 B/op	      72 allocs/op
BenchmarkMemoTableInt32/5000_unique_n_65535/xxh3-12            	    1020	   1295938 ns/op	  349184 B/op	       5 allocs/op
BenchmarkMemoTableInt32/5000_unique_n_65535/go_map-12          	     489	   2457437 ns/op	  420230 B/op	     187 allocs/op
BenchmarkMemoTable/100_unique_len_32-32_n_65535/xxh3-12        	     398	   2990361 ns/op	   16904 B/op	      23 allocs/op
BenchmarkMemoTable/100_unique_len_32-32_n_65535/go_map-12      	     428	   2811322 ns/op	   19799 B/op	      28 allocs/op
BenchmarkMemoTable/100_unique_len_8-32_n_65535/xxh3-12         	     356	   3463212 ns/op	   16904 B/op	      23 allocs/op
BenchmarkMemoTable/100_unique_len_8-32_n_65535/go_map-12       	     380	   3149174 ns/op	   19781 B/op	      28 allocs/op
BenchmarkMemoTable/1000_unique_len_32-32_n_65535/xxh3-12       	     366	   3188208 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTable/1000_unique_len_32-32_n_65535/go_map-12     	     361	   9588561 ns/op	  211407 B/op	      67 allocs/op
BenchmarkMemoTable/1000_unique_len_8-32_n_65535/xxh3-12        	     336	   3529636 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTable/1000_unique_len_8-32_n_65535/go_map-12      	     326	   3676726 ns/op	  211358 B/op	      67 allocs/op
BenchmarkMemoTable/5000_unique_len_32-32_n_65535/xxh3-12       	     252	   4702015 ns/op	  992584 B/op	      42 allocs/op
BenchmarkMemoTable/5000_unique_len_32-32_n_65535/go_map-12     	     255	   4884533 ns/op	 1131141 B/op	     181 allocs/op
BenchmarkMemoTable/5000_unique_len_8-32_n_65535/xxh3-12        	     235	   4814247 ns/op	  722248 B/op	      41 allocs/op
BenchmarkMemoTable/5000_unique_len_8-32_n_65535/go_map-12      	     244	   5340692 ns/op	  860673 B/op	     179 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_32-32/go_map-12    	    4509	    271040 ns/op	  211432 B/op	      67 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_32-32/xxh3-12      	    8245	    150411 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_8-32/go_map-12     	    4552	    255188 ns/op	  211443 B/op	      67 allocs/op
BenchmarkMemoTableAllUnique/values_1024_len_8-32/xxh3-12       	    9828	    148377 ns/op	  176200 B/op	      32 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_32-32/go_map-12   	     100	  11416073 ns/op	 6324082 B/op	    1176 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_32-32/xxh3-12     	     222	   5578530 ns/op	 3850569 B/op	      49 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_8-32/go_map-12    	     100	  11123497 ns/op	 6323152 B/op	    1171 allocs/op
BenchmarkMemoTableAllUnique/values_32767_len_8-32/xxh3-12      	     212	   6094342 ns/op	 3850569 B/op	      49 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_32-32/go_map-12   	      44	  25062816 ns/op	12580560 B/op	    2384 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_32-32/xxh3-12     	      74	  15704849 ns/op	10430028 B/op	      53 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_8-32/go_map-12    	      40	  26667808 ns/op	12580008 B/op	    2381 allocs/op
BenchmarkMemoTableAllUnique/values_65535_len_8-32/xxh3-12      	      81	  14417075 ns/op	10430024 B/op	      53 allocs/op

The ones labeled xxh3 are my custom implementation and the go_map is the go builtin map based implementation, if you look closely at the results of the benchmark, in most cases, the xxh3 based implementation is fairly significantly faster, sometimes even twice as fast, with significantly fewer allocations per loop (for example, in the binary case with all unique values you can see the 2384 allocations in the go map based implementation vs the 53 in my custom implementation, having a ~37% performance improvement over the go-map based implementation.

But if we look at some other cases, for example the 100_unique_len_32-32_n_65535 and 100_unique_len_8-32_n_65535 cases, which correspond to 65535 binary strings of length 32 or of lengths between 8 and 32 with exactly 100 unique values among them, we see that the go map based implementation is actually slightly more performant despite a few more allocations. The same thing happens with the Int32 memotable with only 100 unique values over 65535 values, but when we increase to 1000 unique values or 5000 unique values my custom one seems to do better. Which seems to indicate that the builtin go map is faster in cases with lower cardinality of unique values, while my custom implementation is more performant for inserting new values. Except in the Float64 case, where my custom implementation is faster in all cases even in the 100 unique value case, which I attribute to the faster hashing of smaller byte values via xxh3.

TL;DR: All in all, in most cases, the custom implementation is faster but in some cases with lower cardinality of unique values (specifically int32 and binary strings) the implementation using go's map as the basis can be more performant.

@zeroshade
Copy link
Member Author

zeroshade commented Aug 11, 2021

@emkornfield Just to tack on here, another interesting view is looking at a flame graph of the CPU profile for the BenchmarkMemoTableAllUnique benchmark case, just benchmarking the binary string case where the largest difference between the two is that in the builtin Go Map based implementation I use a map[string]int to map strings to their memo index, whereas in the custom implementation I use an Int32HashTable to map the hash of the string to the memo index, with the hash of the string being calculated with the custom hash implementation.

image

Looking at the flame graph you can see that a larger proportion of the CPU time for the builtin map-based implementation is spent in the map itself whether performing the hashes or accessing/growing/allocating vs adding the strings to the BinaryBuilder while in the xxh3 based custom implementation there's a smaller proportion of the time spent actually performing the hashing and the lookups/allocations. In the benchmarks I'm specifically using 0 when creating the new memo table to avoid pre-allocating in order to make the comparison between the go map implementation a closer / better comparison since, to my knowledge, there's no way to pre-allocate a size for the builtin golang map. But if I change that and have it actually use reserve to pre-allocate space the difference can become more pronounced.

@zeroshade
Copy link
Member Author

Just to chime in one last piece on this: while it's extremely interesting that a custom hash table is performing better than Go's builtin map in many cases, remember that we're still talking in absolute terms about differences of between 1ms and 0.1ms, so unless you're using it in a tight loop with a TON of entries/lookups, you're probably better off using Go's builtin map just because it's a simpler implementation and built-in rather than having to use a custom hash table implementation with external dependencies as it'll be more than sufficiently performant in most cases, but for this low level handling for dictionary encoding in parquet, the performance can become significant on the scale that this will be used, making it preferable to use the custom implementation.

@emkornfield
Copy link
Contributor

+1 merging. Thank you @zeroshade

asfgit pushed a commit that referenced this pull request Sep 13, 2021
Here's the next chunk of code following the merging of #10716

Thankfully the metadata package is actually relatively small compared to everything else so far.

@emkornfield @sbinet @nickpoorman for visibility

Closes #10951 from zeroshade/parquet-metadata-package

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
@zeroshade zeroshade deleted the parquet-encoding-p2 branch September 27, 2021 17:14
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Here's the next chunk of code following the merging of apache#10716

Thankfully the metadata package is actually relatively small compared to everything else so far.

@emkornfield @sbinet @nickpoorman for visibility

Closes apache#10951 from zeroshade/parquet-metadata-package

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants